自我监督的预训练技术在文档AI中取得了显着进步。大多数多模式的预训练模型都使用蒙版的语言建模目标来学习文本模式的双向表示,但是它们在图像模式的预训练目标方面有所不同。这种差异增加了多模式表示学习的困难。在本文中,我们建议\ textbf {layoutlmv3}为文档AI预训练多模式变压器,并具有统一的文本和图像掩蔽。此外,LayoutLMV3通过单词斑点对齐目标进行了预训练,可以通过预测是否掩盖文本的相应图像贴片来学习交叉模式对齐。简单的统一体系结构和培训目标使Layoutlmv3成为以文本为中心和以图像为中心的文档AI任务的通用预培训模型。实验结果表明,LayoutLMV3不仅在以文本为中心的任务中实现最先进的绩效,包括形式的理解,收据理解和文档视觉问题回答,而且在以图像为中心的任务(例如文档图像分类和文档布局)中分析。代码和模型可在\ url {https://aka.ms/layoutlmv3}上公开获得。
translated by 谷歌翻译
图像变压器最近使用监督(VIT,DEIT等)或自我监督(BEIT,MAE等)预训练技术取得了显着的自然图像理解进展。在本文中,我们提出了\ textbf {dit},一种自我保护的预训练\ textbf {d} ocument \ textbf {i} mage \ textbf {t} ransformer模型,使用大规模的不尺度的文本图像用于文档AI任务,这是必不可少的,因为由于缺乏人类标记的文档图像,因此没有受到监督的同行。我们将DIT作为骨干网络在各种基于视觉的文档AI任务中,包括文档图像分类,文档布局分析,表检测以及OCR的文本检测。实验结果表明,自我监管的预训练的DIT模型可在这些下游任务上实现新的最新结果,例如文档图像分类(91.11 $ \ rightarrow $ 92.69),文档布局分析(91.0 $ \ rightArow $ 94.9),表检测(94.23 $ \ rightArrow $ 96.55)和OCR的文本检测(93.07 $ \ rightarrow $ 94.29)。代码和预培训模型可在\ url {https://aka.ms/msdit}上公开获得。
translated by 谷歌翻译
文档AI或Document Intelligence是一个相对较新的研究主题,指的是自动阅读,理解和分析业务文档的技术。它是自然语言处理和计算机视觉的重要研究方向。近年来,深度学习技术的普及已经大大提高了文档AI的发展,如文件布局分析,视觉信息提取,文档视觉问题应答,文档图像分类等。本文简要评论了一些代表性模型,任务和基准数据集。此外,我们还介绍了早期的启发式规则的文档分析,统计机器学习算法,深度学习方法,尤其是预训练方法。最后,我们展望未来的Document AI研究方向。
translated by 谷歌翻译
文本识别是文档数字化的长期研究问题。现有的方法通常是基于CNN构建的,以用于图像理解,并为Char-Level文本生成而建立RNN。此外,通常需要另一种语言模型来提高整体准确性作为后处理步骤。在本文中,我们提出了一种使用预训练的图像变压器和文本变压器模型(即Trocr)提出的端到端文本识别方法,该模型利用了变压器体系结构,以实现图像理解和文字级级文本生成。TROR模型很简单,但有效,可以通过大规模合成数据进行预训练,并通过人体标记的数据集进行微调。实验表明,TROR模型的表现优于印刷,手写和场景文本识别任务上的当前最新模型。Trocr模型和代码可在\ url {https://aka.ms/trocr}上公开获得。
translated by 谷歌翻译
由于其有效的模型架构以及大规模未标记的扫描/数字出生的文件的优势,在各种视觉上丰富的文档理解任务中已经证明了文本和布局的预先培训。我们提出了具有新的预培训任务的Layoutlmv2架构,以在单个多模态框架中模拟文本,布局和图像之间的交互。具体地,对于双流多模态变压器编码器,LayOutLMV2不仅使用现有屏蔽的视觉语言建模任务,还使用新的文本图像对齐和文本图像匹配任务,这使得它更好地捕获跨模块交互在预训练阶段。同时,它还将空间感知的自我注意机制集成到变压器架构中,以便模型可以完全理解不同文本块之间的相对位置关系。实验结果表明,LayoutLMV2优于大幅度的LayOutlm,并在大量下游的下游富有的文件理解任务中实现了新的最先进的结果,包括Funsd(0.7895 $ \至0.8420美元),电源线(0.9493 $ \至0.9601美元),Srie(0.9524 $ \至0.9781美元),Kleister-NDA(0.8340 $ \ 0.8520美元),RVL-CDIP(0.9443 $ \至0.9564美元),DOCVQA(0.7295 $ \至0.8672美元) 。我们使我们的模型和代码公开可用于\ url {https://aka.ms/layoutlmv2}。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Learning the underlying distribution of molecular graphs and generating high-fidelity samples is a fundamental research problem in drug discovery and material science. However, accurately modeling distribution and rapidly generating novel molecular graphs remain crucial and challenging goals. To accomplish these goals, we propose a novel Conditional Diffusion model based on discrete Graph Structures (CDGS) for molecular graph generation. Specifically, we construct a forward graph diffusion process on both graph structures and inherent features through stochastic differential equations (SDE) and derive discrete graph structures as the condition for reverse generative processes. We present a specialized hybrid graph noise prediction model that extracts the global context and the local node-edge dependency from intermediate graph states. We further utilize ordinary differential equation (ODE) solvers for efficient graph sampling, based on the semi-linear structure of the probability flow ODE. Experiments on diverse datasets validate the effectiveness of our framework. Particularly, the proposed method still generates high-quality molecular graphs in a limited number of steps.
translated by 谷歌翻译
Despite some successful applications of goal-driven navigation, existing deep reinforcement learning-based approaches notoriously suffers from poor data efficiency issue. One of the reasons is that the goal information is decoupled from the perception module and directly introduced as a condition of decision-making, resulting in the goal-irrelevant features of the scene representation playing an adversary role during the learning process. In light of this, we present a novel Goal-guided Transformer-enabled reinforcement learning (GTRL) approach by considering the physical goal states as an input of the scene encoder for guiding the scene representation to couple with the goal information and realizing efficient autonomous navigation. More specifically, we propose a novel variant of the Vision Transformer as the backbone of the perception system, namely Goal-guided Transformer (GoT), and pre-train it with expert priors to boost the data efficiency. Subsequently, a reinforcement learning algorithm is instantiated for the decision-making system, taking the goal-oriented scene representation from the GoT as the input and generating decision commands. As a result, our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process, leading to superior navigation performance. Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization, compared with other state-of-art baselines. Demonstration videos are available at \colorb{https://youtu.be/93LGlGvaN0c.
translated by 谷歌翻译
Deep neural networks (DNNs) are found to be vulnerable to adversarial attacks, and various methods have been proposed for the defense. Among these methods, adversarial training has been drawing increasing attention because of its simplicity and effectiveness. However, the performance of the adversarial training is greatly limited by the architectures of target DNNs, which often makes the resulting DNNs with poor accuracy and unsatisfactory robustness. To address this problem, we propose DSARA to automatically search for the neural architectures that are accurate and robust after adversarial training. In particular, we design a novel cell-based search space specially for adversarial training, which improves the accuracy and the robustness upper bound of the searched architectures by carefully designing the placement of the cells and the proportional relationship of the filter numbers. Then we propose a two-stage search strategy to search for both accurate and robust neural architectures. At the first stage, the architecture parameters are optimized to minimize the adversarial loss, which makes full use of the effectiveness of the adversarial training in enhancing the robustness. At the second stage, the architecture parameters are optimized to minimize both the natural loss and the adversarial loss utilizing the proposed multi-objective adversarial training method, so that the searched neural architectures are both accurate and robust. We evaluate the proposed algorithm under natural data and various adversarial attacks, which reveals the superiority of the proposed method in terms of both accurate and robust architectures. We also conclude that accurate and robust neural architectures tend to deploy very different structures near the input and the output, which has great practical significance on both hand-crafting and automatically designing of accurate and robust neural architectures.
translated by 谷歌翻译
A crucial issue of current text generation models is that they often uncontrollably generate factually inconsistent text with respective of their inputs. Limited by the lack of annotated data, existing works in evaluating factual consistency directly transfer the reasoning ability of models trained on other data-rich upstream tasks like question answering (QA) and natural language inference (NLI) without any further adaptation. As a result, they perform poorly on the real generated text and are biased heavily by their single-source upstream tasks. To alleviate this problem, we propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. WeCheck first utilizes a generative model to accurately label a real generated sample by aggregating its weak labels, which are inferred from multiple resources. Then, we train the target metric model with the weak supervision while taking noises into consideration. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4\% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
translated by 谷歌翻译